library(MASS)
# load the data
data("Boston")
library(MASS)
# load the data
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Boston dataframe has 506 observations of 14 variables:
crim - per capita crime rate by town.zn - proportion of residential land zoned for lots over 25,000 sq.ft.indus - proportion of non-retail business acres per town.chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).nox - nitrogen oxides concentration (parts per 10 million).rm - average number of rooms per dwelling.age - proportion of owner-occupied units built prior to 1940.dis - weighted mean of distances to five Boston employment centres.rad- index of accessibility to radial highways.tax - full-value property-tax rate per $10,000.ptratio - pupil-teacher ratio by town.black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.lstat - lower status of the population (percent).medv - median value of owner-occupied homes in $1000s.The information about the dataset can be found here. ##3 Plotting the data
# plot matrix of the variables
pairs(Boston)
Pairs graph is a mess so far, so let’s wait if we can discard some variables and then maybe try drawing it again…
Let’s draw the correlations between the variables.
library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)
We can see that the most correlated variables are
library(reshape2)
CM <- cor_matrix # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA # lower tri and diag set to NA
subset(melt(CM, na.rm = TRUE), abs(value) > .7)
## Var1 Var2 value
## 59 indus nox 0.76
## 89 nox age 0.73
## 101 indus dis -0.71
## 103 nox dis -0.77
## 105 age dis -0.75
## 129 indus tax 0.72
## 135 rad tax 0.91
## 195 lstat medv -0.74
Variables rad and tax are highly positively correlated (0.91). This can be interpretated so that the properties with better access to city’s radial highways also have higher property tax.
Next ones are nox and dis with -0.77 negative correlation. Interpretation is that smaller the weighted average distance to the five Boston employment centers, the larger is the nitrogen oxides concentration (worse air pollution).
As a third, with 0.76 positive correlation can be found ìndus and nox. Interpretation is that higher the proportion of non-retail business acres in town, higher is the concentration of the nitrogen oxides.
(One thing that can be noted is that the status of the inhabitants and the number of rooms correlate strongly with the housing value.)
For convenience, let’s plot histograms only for these variables of interest.
library(ggpubr)
library(ggplot2)
x <- seq(1, length.out=dim(Boston)[1])
rad_plot<-ggplot(Boston, aes(x=rad)) +
geom_histogram()
tax_plot <- ggplot(Boston, aes(x=tax)) +
geom_histogram()
nox_plot <- ggplot(Boston, aes(x=nox)) +
geom_histogram()
dis_plot <- ggplot(Boston, aes(x=dis)) +
geom_histogram()
indus_plot <- ggplot(Boston, aes(x=indus)) +
geom_histogram()
ggarrange(rad_plot, tax_plot, nox_plot, dis_plot, indus_plot,
labels = c("A", "B", "C", "D", "E"),
ncol = 3, nrow = 2)
Plot A rad is index of accessibility to radial highways. There are two concentrations in the graph, first one is index of 4-5 and the second index 24. There are 132 observations in the latter one (with index value 24), which is a high percentage of all the observations. It would make sense to check out the map of Boston to better understand the outliers.
Plot B tax has the full-value property-tax rate, and the distribution is pretty similar to what we had in plot A, i.e. concentration in the first quartile and high amount of outliers in the last quartile (132 observations, value 666).
Plots C and D (nox and dis) have also similar distributions skewed to the right.
Plot E indus is skewed to the right, and has a high count (132 observations) on a single value (18.1).
There seems to be 132 observations in the data, that are outliers for some reason.
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Scaled dataset is of type matrix, so we’ll convert it into a dataframe. In the new dataframe all the variables have zero mean.
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Linear discriminant analysis produces results based on the assumptions that + the variables are normally distributed (on condition of the classes), + the normal distributions for each class share the same covariance matrix, and + the variables are continuous
Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))
# look at the table of the new factor crime
#table(crime)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
#lda.fit
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
From the graph we can see the variables that imply high crime, the most important factor being rad (the arrows pointing at the high crime cluster).
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 23 9 0 0
## med_low 6 11 8 0
## med_high 0 7 10 2
## high 0 0 0 26
The model seems to predict high crime very well (25 out of 25 are correct) and taking a look at the predictions as a whole, the model seems to do better predicting higher rates. In all cases, the model had predicted correctly more often than wrong.
First we calculate the Euclidean distance, and then with Manhattan distance.
library(factoextra)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- get_dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Let’s visualise the distances between the variables
fviz_dist(dist_eu, show_labels = TRUE, lab_size = 0.6, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
############
# K-means #
############
k_max <- 10
set.seed(123)
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
From the graph we choose to have two centers, as it is the point where the slope of the line changes from steep to level. Let’s run K-means with two centers.
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
library(tidyverse)
library(data.table)
boston_scaled %>%
mutate(Cluster = km$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean") %>%
DT::datatable(extensions = 'FixedColumns', options = list(dom = 't',
scrollX = TRUE,
scrollCollapse = TRUE))
You can scroll the datatable to see the rest of the variables, which are summarised by their means grouped by cluster number. We can see that there are differences between the two clusters, e.g. cluster #1 has higher crime, bigger proportion of non-retail business acres, more pollution, buildings are older and it’s further from the employment centers, there are less teachers per pupil and the median value of owner-occupied homes is lower.
Lets draw pairs according to the two clusters:
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
Color red is cluster #1 and black is nicer neighbourhood cluster #2.
We can see from the graph what we earlier saw numerically from the datatable. In the upper row we have crime against the variables, and we can see that cluster #1 has higher crime rates in every aspect than cluster #2.
# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[1:5], col = km$cluster)
If we compare crim and indus we see that the red cluster has higher crime and it is more industrialised than the black one. It’s the same with variable nox, and it’s logical that there is also more air pollution. zn tells us that red cluster is not near the river area, unlike the black cluster.
We could go on and on analysing all the variables in the graph, but I’m certain it’s not the meaning of this exercise.
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
k3 <- kmeans(boston_scaled, centers = 3)
k4 <- kmeans(boston_scaled, centers = 4)
k5 <- kmeans(boston_scaled, centers = 5)
k6 <- kmeans(boston_scaled, centers = 6)
k7 <- kmeans(boston_scaled, centers = 7)
# plots to compare
p2 <- fviz_cluster(k3, geom = "point", data = boston_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = boston_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = boston_scaled) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point", data = boston_scaled) + ggtitle("k = 6")
p6 <- fviz_cluster(k7, geom = "point", data = boston_scaled) + ggtitle("k = 7")
library(gridExtra)
grid.arrange(p2, p3, p4, p5, p6, nrow = 3)
I choose k=3 as it has the best separation in my opinion.
set.seed(123)
km <- boston_scaled %>%
kmeans(centers = 3)
lda.fit <-lda(km$cluster~.,data=boston_scaled)
classes<-as.numeric(km$cluster)
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 4)
Rad and tax (and maybe indus) are the most influencial linear separators for cluster #2. Age and chas for cluster #1. Dis and rm for cluster #3. Black affects towards clusters #1 and #3.